Search CORE

10 research outputs found

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

Author: Bramas Berenger
Publication venue: 'The Science and Information Organization'
Publication date: 01/10/2017
Field of study

The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

arXiv.org e-Print Archive

MPG.PuRe

Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

Author: Bramas Berenger
Kus Pavel
Publication venue
Publication date: 01/04/2018
Field of study

The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.Comment: Published in Peer J C

arXiv.org e-Print Archive

Directory of Open Access Journals

MPG.PuRe

Parallelization of the Lattice-Boltzmann schemes using the task-based method

Author: Bramas Berenger
Flint Clément
Genaud Stephane
Helluy Philippe
Publication venue: HAL CCSD
Publication date: 05/07/2022
Field of study

National audienceThe popularization of graphic processing units (GPUs) has led to their extensive use in highperformance numerical simulations. The Lattice Boltzmann Methodology (LBM) is a general framework for constructing efficient numerical fluid simulations. In this scheme, the fluid quantities are approximated on a structured grid. At each time step, a shift-relaxation process is applied, where each kinetic value is shifted to the corresponding direction in the lattice. Thanks to its simplicity, the LBM is subject to many software optimizations. State-of-the-art techniques aim at adapting the LBM scheme to improve the computational throughput on modern processors. Currently, most effort is put into optimizing this process on GPUs, as their architecture is highly suited for this type of computation. A bottleneck of GPU implementations is that the data size of the simulation is limited by the GPU memory. This restricts the number of volume elements and, therefore, the degree of precision one can obtain. In this work, we divide the lattice structure into multiple subsets that can be executed individually. This allows the work to be distributed among different processing units at the cost of increased complexity and memory transfers. But the constraint on GPU memory is relaxed, as the subsets can be made as small as needed. Additionally, we use the task-based approach for parallelizing the application, which allows the computation to be efficiently distributed among multiple processing units

INRIA a CCSD electronic archive server

Combler l'écart de performance entre OpenMP 4.0 et les moteurs d'exécution pour la méthode des multipoles rapide

Author: Agullo Emmanuel
Aumage Olivier
Bramas Berenger
Coulaud Olivier
Pitoiset Samuel
Publication venue: HAL CCSD
Publication date: 01/03/2016
Field of study

With the advent of complex modern architectures, the low-levelparadigms long considered sufficient to build High Performance Computing (HPC)numerical codes have met their limits. Achieving efficiency, ensuringportability, while preserving programming tractability on such hardwareprompted the HPC community to design new, higher level paradigms.The successful ports of fully-featured numerical libraries on severalrecent runtime system proposals have shown, indeed, the benefit oftask-based parallelism models in terms of performance portability oncomplex platforms. However, the common weakness of these projects is todeeply tie applications to specific expert-only runtime system APIs. The\omp specification, which aims at providing a common parallelprogramming means for shared-memory platforms, appears as a goodcandidate to address this issue thanks to the latest task-basedconstructs introduced as part of its revision 4.0.The goal of this paper is to assess the effectiveness and limits ofthis support for designing a high-performance numerical library. Weillustrate our discussion with the \scalfmm library, which implementsstate-of-the-art fast multipole method (FMM) algorithms, that wehave deeply re-designed with respect to the most advancedfeatures provided by \omp 4. We show that \omp 4 allows forsignificant performance improvements over previous \omp revisions onrecent multicore processors. We furthermore propose extensions to the\omp 4 standard and show how they can enhance FMM performance. Toassess our statement, we have implemented this support within the\klanglong source-to-source compiler that translates \omp directives intocalls to the \starpu task-based runtime system. This study shows thatwe can take advantage of the advanced capabilities of a fully-featuredruntime system without resorting to a specific, native runtime port,hence bridging the gap between the \omp standard and the very highperformance that was so far reserved to expert-only runtime systemAPIs.Avec l'arrivée des architectures modernes complexes, les paradigmes de parallélisation de bas niveau, longtemps considérés comme suffisant pour développer des codes numériques efficaces, ont montré leurs limites. Obtenir de l'efficacité et assurer la portabilité tout en maintenant une bonne flexibilité de programmation sur de telles architectures ont incité la communauté du calcul haute performance (HPC) à concevoir de nouveaux paradigmes de plus haut niveau.Les portages réussis de bibliothèques numériques sur plusieurs moteurs exécution récentos ont montré l'avantage des modèles de parallélisme à base de tâche en ce qui concerne la portabilité et la performance sur ces plateformes complexes. Cependant, la faiblesse de tous ces projets est de fortement coupler les applications aux experts des API des moteurs d'exécution.La spécification d'\omp, qui vise à fournir un modèle de programmation parallèle unique pour les plates-formes à mémoire partagée, semble être un bon candidat pour résoudre ce problème. Notamment, en raison des améliorations apportées à l’expressivité du modèle en tâches présentées dans sa révision 4.0.Le but de ce papier est d'évaluer l'efficacité et les limites de ce modèle pour concevoir une bibliothèque numérique performante. Nous illustrons notre discussion avec la bibliothèque \scalfmm, qui implémente les algorithmes les plus récents de la méthode des multipôles rapide (FMM). Nous avons finement adapté ces derniers pour prendre en compte les caractéristiques les plus avancées fournies par \omp4. Nous montrons qu'\omp4 donne de meilleures performances par rapport aux versions précédentes d'\omp pour les processeurs multi-coeurs récents. De plus, nous proposons des extensions au standard d’\omp4 et nous montrons comment elles peuvent améliorer la performance de la FMM. Pour évaluer notre propos, nous avons mis en oeuvre ces extensions dans le compilateur source-à-source \klanglong qui traduit les directives \omp en des appels au moteur d'exécution à base de tâches \starpu. Cette étude montre que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans devoir recourir à un portage sur l'API spécifique de celui-ci.%d'un moteur d'exécution. %prouve que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans recourir à un portage spécifique dans le moteur d’exécution. Par conséquent, on comble le fossé entre le standard \omp et l’approche très performante par moteur d’exécution qui est de loin réservée au seul expert son API

INRIA a CCSD electronic archive server

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

Author: Bramas Berenger
Publication venue: 'PeerJ'
Publication date: 17/05/2021
Field of study

International audienceThe way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the wellknown Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(logN) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average

arXiv.org e-Print Archive

HAL-Inserm

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Parallelization of the Lattice-Boltzmann schemes using the task-based method

Author: Bramas Berenger
Flint Clément
Genaud Stephane
Helluy Philippe
Publication venue: HAL CCSD
Publication date: 05/07/2022
Field of study

INRIA a CCSD electronic archive server

HAL-Inserm

HAL Descartes

Hal-Diderot

Parallelization of the Lattice-Boltzmann schemes using the task-based method

Author: Bramas Berenger
Flint Clément
Genaud Stephane
Helluy Philippe
Publication venue: HAL CCSD
Publication date: 05/07/2022
Field of study

HAL-Inserm

Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method

Author: Berenger Bramas
Emmanuel Agullo
Olivier Aumage
Olivier Coulaud
Samuel Pitoiset
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Towards EXtreme scale technologies and accelerators for euROhpc hw/Sw supercomputing applications for exascale: The TEXTAROSSA approach

Author: Agosta Giovanni
Aldinucci Marco
Alvarez Carlos
Ammendola Roberto
Arfat Yasir
Beaumont Olivier
Bernaschi Massimo
Biagioni Andrea
Boccali Tommaso
Bramas Berenger
Brandolese Carlo
Cantalupo Barbara
Carrozzo Mauro
Cattaneo Daniele
Celestini Alessandro
Celino Massimo
Colonnelli Iacopo
Cretaro Paolo
Danelutto Marco
D’Ambra Pasqua
Esposito Roberto
Eyraud-Dubois Lionel
Filgueras Antonio
Fornaciari William
Frezza Ottorino
Galimberti Andrea
Giacomini Francesco
Goglin Brice
Gregori Daniele
Guermouche Abdou
Iannone Francesco
Kulczewski Michal
Lo Cicero Francesca
Lonardo Alessandro
Martinelli Alberto R.
Martinelli Michele
Martorell Xavier
Massari Giuseppe
Mittone Gianluca
Montangero Simone
Namyst Raymond
Oleksiak Ariel
Palazzari Paolo
Paolucci Pier Stanislao
Reghenzani Federico
Rossi Cristian
Saponara Sergio
Simula Francesco
Terraneo Federico
Thibault Samuel
Torquati Massimo
Turisini Matteo
Vicini Piero
Vidal Miquel
Zoni Davide
Zummo Giuseppe
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

In the near future, Exascale systems will need to bridge three technology gaps to achieve high performance while remaining under tight power constraints: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetic; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA addresses these gaps through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models, and tools derived from European research

Archivio istituzionale della ricerca - Politecnico di Milano

INRIA a CCSD electronic archive server

TEXTAROSSA: Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale

Author: Agosta Giovanni
Aldinucci Marco
Alvarez Carlos
Ammendola Roberto
Arfat Yasir
Beaumont Olivier
Bernaschi Massimo
Biagioni Andrea
Boccali Tommaso
Bramas Berenger
Brandolese Carlo
Cantalupo Barbara
Carrozzo Mauro
Cattaneo Daniele
Celestini Alessandro
Celino Massimo
Colonnelli Iacopo
Cretaro Paolo
Danelutto Marco
D’Ambra Pasqua
Esposito Roberto
Eyraud-Dubois Lionel
Filgueras Antonio
Fornaciari William
Frezza Ottorino
Galimberti Andrea
Giacomini Francesco
Goglin Brice
Gregori Daniele
Guermouche Abdou
Iannone Francesco
Kulczewski Michal
Lo Cicero Francesca
Lonardo Alessandro
Martinelli Alberto R.
Martinelli Michele
Martorell Xavier
Massari Giuseppe
Mittone Gianluca
Montangero Simone
Namyst Raymond
Oleksiak Ariel
Palazzari Paolo
Paolucci Pier Stanislao
Reghenzani Federico
Rossi Cristian
Saponara Sergio
Simula Francesco
Terraneo Federico
Thibault Samuel
Torquati Massimo
Turisini Matteo
Vicini Piero
Vidal Miquel
Zoni Davide
Zummo Giuseppe
Publication venue: HAL CCSD
Publication date: 01/01/2021
Field of study

International audienceTo achieve high performance and high energy efficiency on near-future exascale computing systems, three key technology gaps needs to be bridged. These gaps include: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetics; methods andtools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research

Archivio istituzionale della ricerca - Politecnico di Milano

HAL-Inserm

INRIA a CCSD electronic archive server

HAL Descartes

Archivio della Ricerca - Università di Pisa

Hal-Diderot

Institutional Research Information System University of Turin